Robust spoken document retrieval methods for misrecognition and out-of-vocabulary keywords
نویسندگان
چکیده
This paper describes a Japanese spoken document retrieval system that is robust for Out-of-Vocabulary (OOV) words. A standard approach to spoken document retrieval is to automatically transcribe spoken documents into word sequences, which can be directly matched against queries. In this approach, the documents including OOV words and words misrecognized as other words cannot be retrieved. To avoid this problem, we propose a novel method of spoken document retrieval considering OOV keywords. One approach we use is to create an index from multiple recognizer outputs to deal with transcribed documents including misrecognized words. The index becomes better to use multiple recognizers which have different characteristics from one another. The other is to use both wordbased indexing for in-vocabulary keywords and syllable-based indexing for OOV keywords, then switch them according to in-vocabulary/OOV keywords in the query. Evaluation results clearly show that this approach benefits from the advantages of both indexing methods and that the proposed technique is quite effective in robustly retrieving spoken documents. © 2004 Wiley Periodicals, Inc. Syst Comp Jpn, 35(14): 44–53, 2004; Published online in Wiley InterScience (www.interscience.wiley.com). DOI 10.1002/scj.10697
منابع مشابه
Spoken Term Detection for Persian News of Islamic Republic of Iran Broadcasting
Islamic Republic of Iran Broadcasting (IRIB) as one of the biggest broadcasting organizations, produces thousands of hours of media content daily. Accordingly, the IRIBchr('39')s archive is one of the richest archives in Iran containing a huge amount of multimedia data. Monitoring this massive volume of data, and brows and retrieval of this archive is one of the key issues for this broadcasting...
متن کاملLanguage Model Expansion Using Webdata for Spoken Document Retrieval
In recent years, there has been increasing demand for ad hoc retrieval of spoken documents. We can use existing text retrieval methods by transcribing spoken documents into text data using a Large Vocabulary Continuous Speech Recognizer (LVCSR). However, retrieval performance is severely deteriorated by recognition errors and out-of-vocabulary (OOV) words. To solve these problems, we previously...
متن کاملA robust/fast spoken term detection method based on a syllable n-gram index with a distance metric
For spoken document retrieval, it is crucial to consider Out-of-vocabulary (OOV) and the mis-recognition of spoken words. Consequently, sub-word unit based recognition and retrieval methods have been proposed. This paper describes a Japanese spoken term detection method for spoken documents that robustly considers OOV words and mis-recognition. To solve the problem of OOV keywords, we use indiv...
متن کاملGraph-based Document Expansion and Robust SCR Models for False Positives: Experiments at the NTCIR-12 SpokenQuery&Doc-2
In this paper, we report our experiments at NTCIR-12 Spoken Query&Doc-2 task. We participated spoken query driven spoken content retrieval (SQ-SCR) subtasks of Spoken Query&Doc2. We submited two types of results, which are conventional spoken content retrieval method (referred to as C-SCR) and STD based approach for SCR (referred to as STD-SCR). The latter was proposed in order to deal with spe...
متن کاملUsing Semantic and Phonetic Term Similarity for Spoken Document Retrieval and Spoken Query Processing
In classical Information Retrieval systems a relevant document will not be retrieved in response to a query if the document and query representations do not share at least one term. This problem is known as “term mismatch”. A similar problem can be found in spoken document retrieval and spoken query processing, where terms misrecognized by the speech recognition process can hinder the retrieval...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Systems and Computers in Japan
دوره 35 شماره
صفحات -
تاریخ انتشار 2004